Predictions and model objects in linear regression
python
datacamp
statistics
machine learning
linear regression
logistic regression
Author
kakamana
Published
January 8, 2023
Predictions and model objects
This article explores how linear regression models can be used to predict Taiwanese house prices and Facebook advert clicks. Our regression skills will also be developed through the use of hands-on model objects, as well as the concept of “regression to the mean” and how to transform variables within a dataset.
This is my learning experience of data science through DataCamp
Predicting house prices
Predictions can be made using statistical models like linear regression. In other words, you specify each explanatory variable, feed it into the model, and get a prediction.
Code
# Import numpy with alias npimport numpy as np# Import seaborn with alias snsimport pandas as pdimport seaborn as sns# Import matplotlib.pyplot with alias pltimport matplotlib.pyplot as plt# Import the ols functionfrom statsmodels.formula.api import ols
# Create explanatory_dataexplanatory_data = pd.DataFrame({'n_convenience': np.arange(0, 11)})# Use mdl_price_vs_conv to predict with explanatory_data, call it price_twd_msqprice_twd_msq = mdl_price_vs_conv.predict(explanatory_data)# Print itprint(price_twd_msq)
The prediction DataFrame you created contains a column of explanatory variable values and a column of response variable values. That means you can plot it on the same scatter plot of response versus explanatory data values
Code
# Create a new figure, figfig = plt.figure()sns.regplot(x="n_convenience", y="price_twd_msq", data=taiwan_real_estate, ci=None)# Add a scatter plot layer to the regplotsns.scatterplot(x='n_convenience',y='price_twd_msq',data=prediction_data,color='red',marker='s')# Show the layered plotplt.show()print("\n the predicted points lie on the trend lin")
the predicted points lie on the trend lin
Extracting model elements
The model object created by ols() contains many elements. In order to perform further analysis on the model results, we need to extract its useful bits. The model coefficients, the fitted values, and the residuals are perhaps the most important pieces of the linear model object.
Code
# Print the model parameters of mdl_price_vs_convprint(mdl_price_vs_conv.params)
Using the model coefficients, you can manually calculate predictions. It’s better to use .predict() when making predictions in real life, but doing it manually is helpful for reassuring yourself that predictions aren’t magic.
For simple linear regressions, the predicted value is the intercept plus the slope times the explanatory variable.
response = intercept + slope * explanatory
Code
# Get the coefficients of mdl_price_vs_convcoeffs = mdl_price_vs_conv.params# Get the interceptintercept = coeffs[0]# Get the slopeslope = coeffs[1]# Manually calculate the predictionsprice_twd_msq = intercept + slope * explanatory_dataprint(price_twd_msq)# Compare to the results from .predict()print(price_twd_msq.assign(predictions_auto=mdl_price_vs_conv.predict(explanatory_data)))
# Create a new figure, figfig = plt.figure()# Plot the first layer: y = xplt.axline(xy1=(0,0), slope=1, linewidth=2, color="green")# Add scatter plot with linear regression trend linesns.regplot(x='return_2018',y='return_2019',data=sp500_yearly_returns,ci=None,line_kws={'color':'black'})# Set the axes so that the distances along the x and y axes look the sameplt.axis("equal")# Show the plotplt.show()print('\n The regression trend line looks very different to the y equals x line. As the financial advisors say, "Past performance is no guarantee of future results."')
The regression trend line looks very different to the y equals x line. As the financial advisors say, "Past performance is no guarantee of future results."
Modeling consecutive returns
Let’s quantify the relationship between returns in 2019 and 2018 by running a linear regression and making predictions. By looking at companies with extremely high or extremely low returns in 2018, we can see if their performance was similar in 2019
Code
# Run a linear regression on return_2019 vs. return_2018 using sp500_yearly_returnsmdl_returns = ols("return_2019 ~ return_2018",data=sp500_yearly_returns).fit()# Print the parametersprint(mdl_returns.params)# Create a DataFrame with return_2018 at -1, 0, and 1explanatory_data = pd.DataFrame({'return_2018':[-1,0,1]})# Use mdl_returns to predict with explanatory_dataprint(mdl_returns.predict(explanatory_data))print("\n Investments that gained a lot in value in 2018 on average gained only a small amount in 2019. Similarly, investments that lost a lot of value in 2018 on average also gained a small amount in 2019")
Intercept 0.321321
return_2018 0.020069
dtype: float64
0 0.301251
1 0.321321
2 0.341390
dtype: float64
Investments that gained a lot in value in 2018 on average gained only a small amount in 2019. Similarly, investments that lost a lot of value in 2018 on average also gained a small amount in 2019
Transforming the explanatory variable
When there is no straight-line relationship between the response variable and the explanatory variable, it is sometimes possible to create one by transforming one or both. Let’s transform the explanatory variable.
We’ll look at the Taiwan real estate dataset again, but we’ll use the distance to the nearest MRT (metro) station as the explanatory variable. By taking the square root, you’ll shorten the distance to the metro station for commuters.
Code
# Create sqrt_dist_to_mrt_mtaiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])plt.figure()# Plot using the transformed variablesns.regplot(x='sqrt_dist_to_mrt_m',y='price_twd_msq',data=taiwan_real_estate)plt.show()
Code
# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_mmdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()print(mdl_price_vs_dist.params)
# Create sqrt_dist_to_mrt_mtaiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_mmdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()explanatory_data = pd.DataFrame({"sqrt_dist_to_mrt_m": np.sqrt(np.arange(0, 81, 10) **2),"dist_to_mrt_m": np.arange(0, 81, 10) **2})# Create prediction_data by adding a column of predictions to explantory_dataprediction_data = explanatory_data.assign( price_twd_msq = mdl_price_vs_dist.predict(explanatory_data))# Print the resultprint(prediction_data)
# Create sqrt_dist_to_mrt_mtaiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_mmdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()# Use this explanatory dataexplanatory_data = pd.DataFrame({"sqrt_dist_to_mrt_m": np.sqrt(np.arange(0, 81, 10) **2),"dist_to_mrt_m": np.arange(0, 81, 10) **2})# Use mdl_price_vs_dist to predict explanatory_dataprediction_data = explanatory_data.assign( price_twd_msq = mdl_price_vs_dist.predict(explanatory_data))fig = plt.figure()sns.regplot(x="sqrt_dist_to_mrt_m", y="price_twd_msq", data=taiwan_real_estate, ci=None)# Add a layer of your prediction pointssns.scatterplot(data=prediction_data, x='sqrt_dist_to_mrt_m', y='price_twd_msq', color='red')plt.show()print("\n By transforming the explanatory variable, the relationship with the response variable became linear, and so a linear regression became an appropriate mode")
By transforming the explanatory variable, the relationship with the response variable became linear, and so a linear regression became an appropriate mode
Transforming the response variable too
The response variable can be transformed too, but this means you need an extra step at the end to undo that transformation. That is, you “back transform” the predictions
# Create qdrt_n_impressions and qdrt_n_clicksad_conversion["qdrt_n_impressions"] = ad_conversion['n_impressions'] **0.25ad_conversion["qdrt_n_clicks"] = ad_conversion['n_clicks'] **0.25plt.figure()# Plot using the transformed variablessns.regplot(x='qdrt_n_impressions',y='qdrt_n_clicks',data=ad_conversion,ci=None)plt.show()
Code
# From previous stepad_conversion["qdrt_n_impressions"] = ad_conversion["n_impressions"] **0.25ad_conversion["qdrt_n_clicks"] = ad_conversion["n_clicks"] **0.25# Run a linear regression of your transformed variablesmdl_click_vs_impression = ols('qdrt_n_clicks ~ qdrt_n_impressions', data=ad_conversion).fit()print(mdl_click_vs_impression.summary())
# From previous stepsad_conversion["qdrt_n_impressions"] = ad_conversion["n_impressions"] **0.25ad_conversion["qdrt_n_clicks"] = ad_conversion["n_clicks"] **0.25mdl_click_vs_impression = ols("qdrt_n_clicks ~ qdrt_n_impressions", data=ad_conversion, ci=None).fit()# Use this explanatory dataexplanatory_data = pd.DataFrame({"qdrt_n_impressions": np.arange(0, 3e6+1, 5e5) **.25,"n_impressions": np.arange(0, 3e6+1, 5e5)})# Complete prediction_dataprediction_data = explanatory_data.assign( qdrt_n_clicks = mdl_click_vs_impression.predict(explanatory_data))# Print the resultprint(prediction_data)print("\n Since the response variable has been transformed, you'll now need to back-transform the predictions to correctly interpret your result")
qdrt_n_impressions n_impressions qdrt_n_clicks
0 0.000000 0.0 0.071748
1 26.591479 500000.0 3.037576
2 31.622777 1000000.0 3.598732
3 34.996355 1500000.0 3.974998
4 37.606031 2000000.0 4.266063
5 39.763536 2500000.0 4.506696
6 41.617915 3000000.0 4.713520
Since the response variable has been transformed, you'll now need to back-transform the predictions to correctly interpret your result
In the previous section, we transformed the response variable, ran a regression, and made predictions. However, we are not yet finished! We will need to perform a back-transformation in order to interpret and visualize your predictions correctly.
Code
# Back transform qdrt_n_clicksprediction_data["n_clicks"] = prediction_data['qdrt_n_clicks'] **4print(prediction_data)
# Plot the transformed variablesfig = plt.figure()sns.regplot(x="qdrt_n_impressions", y="qdrt_n_clicks", data=ad_conversion, ci=None)# Add a layer of your prediction pointssns.scatterplot(data=prediction_data, x='qdrt_n_impressions', y='qdrt_n_clicks', color='red')plt.show()print("\n Notice that your back-transformed predictions nicely follow the trend line and allow you to make more accurate predictions")
Notice that your back-transformed predictions nicely follow the trend line and allow you to make more accurate predictions